logo


Audience: Diverse Background


Time: 1 day workshop (6 hours)


Pre-Requisites: Prior experience with the python programming language is essential: this is not an Introduction to Python. Basic competency is assumed. If you have not use python before consider taking: Intro to Python (Data Science Campus) or data camp courses prior to attending.


Brief Description: Natural Language Processing is a sub-field of Artificial Intelligence. It is used for processing and analysing large amounts of natural language (linguistics). Some applications include search engines (Google), text classification (spam filters), identifying sentiments for a product (sentiment analysis), methods for discovering abstract topics in a collection of documents (topic modelling) and machine translation technologies. This is an Introduction to Natural Language Processing, and thus the main concepts are about cleaning, exploring datasets, and applying feature engineering techniques to transform text data into numerical data.



Aims, Objectives and Intended Learning Outcomes: This module will provide a introduction to the Natural Language Processing field using Python programming language. It covers some basic terminology, the process of ‘cleaning’ a dataset, exploring it and applying simple feature engineering techniques to transform the data. By the end of the module learners will understand and apply the necessary steps to ‘clean’, explore and transform their dataset in the appropriate order.


Dataset: Patent Dataset, Hep Dataset (High_Energy_Physics), Spam/Ham


Libraries: Before attending the course please make sure that you read the course instructions that you received.


Acknowledgements: Many thanks to Savvas Stephanides for joining me on a pair programming approach to create the function that performs the text preprocessing and for his code review. Many thanks to Joshi Chaitanya that has provided Hep Dataset and some of his code for this course, to Ian Grimstead and Thanasis Anthopoulos for providing the Patent Dataset, to Gareth Clews, Isabela Breton and Dan Lewis for reviewing the material and the code and Dave Pugh for lending the Regex material. Also thanks to everyone who attended the pilot course to provide feedback about the course.



Chapter 1: Topic Modelling

Approach: Use one algorithm for each topic and mention the rest and where they can find them.

Topic Modelling what, why, where, when, who, how, so what, advantages, disadvantages, limitations

1.1 What is Topic Modelling

Topic Modelling is a method for discovering topics in a document collection. Topic is a collection of dominant keywords. Looking at the keywords, you can identify the topic.

There are many algorithms used for Topic Modelling, such as latent semantic analysis (LSA), Probabilistic Latent Semantic Analysis (pLSA), LDA (Bayesian version of pLSA), lda2vec (extension of word2vec and LDA).

In this course we will only focus on Latent Dirichlet Allocation (LDA) from the gensim library, which seems to be the most popular one.

When using LDA, each document in a set of documents is considered as a mixtrure of various topics. These topics are assigned to the document via the LDA. The assumption is that the topic distribution has a sparse Dirichlet prior, which includes the intuition that documents include a small set of topics and topics include a small set of frequent words. Topics are identified based on the likelihood of having co-occurrent terms.

1.2 Building a simple LDA Model

The quality of the model depends on the text pre-processing, the choice of the algorithm, the number of topics we choose, the variety of the text topics and the parameter tuning.

Steps: change the Steps. make 3, 8, 3, 4, 5, 6, 6, 9, 10 , 11

  1. Create a dictionary using corpora.Dictionary

  2. Create the corpus and get the frequencies of the words using doc2bow

  3. Build and Train the LDA model using gensim.models.ldamodel.LdaModel

  4. View and Interpret Topics using print_topics

  5. Compute model perplexity using log_perplexity. Perplexity is a measure of how well the model can predict the observed values in our sample. We can estimate perplexity for different LDA models. It is considered that the model with the lowest Perplexity performs better in predicting the observed values.

  6. Compute model coherence using CoherenceModel. Coherence Score is a measure of how well the topic model can be humanly-intrepreted (= assessing the quality of the learned topics). A good model will provide coherent topics. The higher the coherence is, the better the topics are extracted.

  7. Visualize topics and keywords using pyLDAvis.gensim.prepare

  8. Find the optimal number of topics by building many LDA models with different values of number of topics and pick the one that gives the highest coherence value.

  9. Find the dominant topic in document

  10. Find the most representative document for each topic

  11. Find the topic distribution across documents


Chapter 2: Information Retrieval


Intended Learning Outcomes: By the end of Chapter 1 you will be able to:-

  • Define key terms in Information Retrieval (IR).

  • List at a high level of abstraction key steps in developing an IR application.

  • Describe how IR can be challenging

  • Describe an Inverted Index

  • Set up an inverted index for a document collection in Python using SCI Learn

  • Define 3 models used to build an IR application

  • Describe the Boolean Retrieval Model

  • Set up a Boolean Retrieval search over a document collection

  • Describe VSM approach to IR

  • Set up a VSM based IR program over a document collection

  • Describe Language Modelling approach to IR

  • Calculate maximum liklihood estimates for terms in a document collection.

  • Apply Linear Interpolation to query/document to determine at probability score for query/document.


2.1 What is Information Retrieval (IR)?


The meaning of the term Information Retrieval can be quite broad.

Every time you look up information to get a task done could be considered IR.

A useful definition given by Manning (2009)

IR is finding material (usually documents) of an unstructured nature (usually text) that satifies an information need from within larger collections (usually stored on computers)


Key Terms used in Information Retrieval


An information need is the topic about which the user desires to know more about.

A query is what the user conveys to the computer in an attempt to communicate the information need.

A document is an information entity the user wants to retrieve.

A document is relevant if the user perceives that it contains information of value with respect to their personal information need.

A collection is a set of documents.

A term is a word or concept that appears in a document or query

An index is a representation of information that makes querying easier



Information Retrieval vs Web Search

IR is more than web search

IR is concerned with the finding of (any kind of) relevant information

Up until a few decades ago, people preferred to get information from other people eg booking travel via a human travel agent, librarians to search for books, paralegals etc. It used to be an activity only a few people engaged in.

The world has changed, hundreds of millions of peope engage in information retrieval (IR) every data through web search. However many other cases of IR eg email search, searching your laptop, interrogating corporate knowledge bases are also commonplace examples of search.

Information retrieval has overtaken database retrieval as most information does not reside in database systems.



 

2.2 The Mechanics of Information Retrieval


 


 

2.3 The Central problem in IR


 


 


 

Related to the above are the following issues:

  1. Document and query Indexing
    How to best represent their contents
  2. Query Evaluation(or retrieval process)
    To what extent does a document correspond to a query.
  3. System Evaluation
    How good is a system ? Are the retrieved documents relevant(precision).
  4. Are all relevant documents retrieved (recall).
  5. Relevant documents need to be found very quickly from vast quantities of data (100’s billions pages in some cases).


 

Questions to tackle in retrieval

  • How is a document represented with the selected keywords ?
  • How are document and query representations compared to calculate a score ?


 

The task in information retrieval is this: we have vast amounts of information to which accurate and speedy access is becoming ever more difficult. One effect of this is that relevant information gets ignored since it is never uncovered, which in turn leads to much duplication of work and effort. With the advent of computers, a great deal of thought has been given to using them to provide rapid and intelligent retrieval systems. The idea of relevance has slipped into the discussion. It is this notion which is at the centre of information retrieval. The purpose of an automatic retrieval strategy is to retrieve all the relevant documents at the same time retrieving as few of the non-relevant as possible. An IR system should generate a ranking which reflects relevance.

Most search engines use bag of words to build retrieval models. The document is treated as a bag of words


 

2.4 Document Representation: The Inverted Index


 

Basic Concept: Each document is described by a set of representative keywords called index terms.
 

Assign a numerical weight to index terms


 


 

   


   


   


   


   


   

The above index is often represented as a dictionary file of terms with an associated postings file.
 

This inverted index structure is essentially without rivals as the most efficient structure for supporting ad hoc text search.


 

Diving into Code


inverted_index_example = ["He likes to wink, He likes to drink!", "He likes to drink, and drink, and drink.", "The thing he likes to drink is ink","The ink he likes to drink is pink","He likes to wink, and drink pink ink" ] 


def set_tokens_to_lowercase(data):
    for index, entry in enumerate(data):
        data[index] = entry.lower()
    return data


def remove_punctuation(data):
    symbols = ",.!"
    for index, entry in enumerate(symbols):
        for index2, entry2 in enumerate (data):
            data[index2] = re.sub(r'[^\w]', ' ', entry2)
    return data

def remove_stopwords_from_tokens(data):
       stop_words = set(stopwords.words("english"))
       new_list = []
       for index, entry in enumerate(data):
           no_stopwords = ""
           entry = entry.split()
           for word in entry:
               if word not in stop_words:
                    no_stopwords = no_stopwords + " " + word 
           new_list.append(no_stopwords)
       return new_list


inverted_index_example = remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(inverted_index_example)))

vectorizer = CountVectorizer()
inverted_index_vectorised = vectorizer.fit_transform(inverted_index_example)

#if u want to look at it
tdm = pd.DataFrame(inverted_index_vectorised.toarray(), columns = vectorizer.get_feature_names())
print (tdm.transpose())
##        0  1  2  3  4
## drink  1  3  1  1  1
## ink    0  0  1  1  1
## likes  2  1  1  1  1
## pink   0  0  0  1  1
## thing  0  0  1  0  0
## wink   1  0  0  0  1


   

The following can be said about the inverted index:-
  • It maps terms to the documents that contain them. It “inverts” the collection (which maps documents to the words they contain)
  • It permit us to answer boolean queries without visiting entire corpus
  • It is slow to construct (requires visiting entire corpus) but this only needs to be done once
  • It can be used for any number of queries
  • It can be done before any queries have been seen
 


   

Exercise:
 

  1. Set up an inverted index using Python that would be built for the following document collection. Basic preprocessing should also be undertaken on the data.

  Doc 1: New home sales top forecasts.
  Doc 2: Home sales rise, in July!
  Doc 3 Increase in home sales, in July.
  Doc 4 July new home sales rise.
 

  1. Write a Python function that will take 2 words (eg “home” and “sales”) and returns document(s) that contains both the words.

 

Optional Extra:

 

Find documents matching query “pink ink”

 

  1. Find document containing both words

  2. Both words has to be a phrase

 

We could have a bi-gram index

 


 

Bi-gram index issues:
  Fast but index size will explode
  What aboout trigram phrases
  What about proximity? “ink is pink”
 

A possible solution: Proximity Index

 

Term position is embedded to the inverted index

 

Called proximity/positional index
  Enables phrase and proximity search


   


   

Implement positional inverted index on data shown below.
  You need to save the following information in terms inverted lists:
  - term (pre-processed) and its document frequency
  - list of documents where this term occured
  - for each document, list of positions where the term occured within the document
 

Doc 1: breakthrough drug for schizophrenia
  Doc 2: new schizophrenia drug
  Doc 3: new approach for treatment of schizophrenia
  Doc 4: new hopes for schizophrenia patients


 

2.5 Taxonomy of Classical IR Models


 

For effectively retrieving relevant documents by IR strategies, the documents are typically transformed into a suitable representation. Each retrieval strategy incorporates a specific model for its document representation purposes.

A retrieval model specifies the details of:
  • Document representation
  • Query representation
  • Retrieval function: how to find relevant results
  • Determines a notion of relevance
 

In classical IR models a document is described as a set of representative keywords - index terms . Each term is assigned a numerical weight to determine relavance.


   


 

2.6 Classical IR Models: Boolean Retrieval


 

The simplest form of document retrieval is for a computer to do this sort of linear scan through documents. This process is commonly GREP referred to as grepping through text, after the Unix command grep, which performs this process. However,searching through large collections (billions to trillions of words) is unacceptably slow. More flexible matching operations require ranked retrieval.

One alternative to linearly scanning is to index the documents in advance.

Suppose we record for each document – here a play of Shakespeare’s – whether it contains each word out of all the words Shakespeare used (INCIDENCE MATRIX about 32,000 different words). The result is a binary term-document incidence matrix, as in Figure. Terms that are indexed are usually words.


   


   

We can have a vector for each term, which shows the documents it appears in, or a vector for each document, showing the terms that occur in it. To answer the query Brutus AND Caesar AND NOT Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwise AND: 110100 AND 110111 AND 101111 = 100100.
  The answers for this query are thus Antony and Cleopatra and Hamlet.


 


 

Question

  1. Contruct the term-document incidence matrix for documents in the Python list below.
  2. What are the returned results for query

    1. drink AND ink AND NOT pink


 

Diving into Code



data = ["He likes to wink, He likes to drink!", "He likes to drink, and drink, and drink.", "The thing he likes to drink is ink","The ink he likes to drink is pink","He likes to wink, and drink pink ink" ] 

data = remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(data)))

binary_vectorizer = CountVectorizer(binary=True)
counts = binary_vectorizer.fit_transform(data)

#if u want to look at it
tdm = pd.DataFrame(counts.toarray(), columns = binary_vectorizer.get_feature_names())
tdm=tdm.transpose()
print (tdm)
##        0  1  2  3  4
## drink  1  1  1  1  1
## ink    0  0  1  1  1
## likes  1  1  1  1  1
## pink   0  0  0  1  1
## thing  0  0  1  0  0
## wink   1  0  0  0  1
def NOT(pterm): 
    for a in range(len(pterm)):
        if(pterm[a] == 0): 
            pterm[a] = 1
        elif(pterm[a] == 1): 
           pterm[a] = 0
    return pterm


term1 =  list(tdm.loc['drink'])
term2 = list(tdm.loc['ink'])
term3 =  NOT(list(tdm.loc['pink']))
terms = list(zip(term1, term2, term3))

vector= [terms[item][0] & terms[item][1] & terms[item][2]for item in range(len(terms))] 

for i in vector:
    if i == 1:
        print ("Document", vector.index(i), "meets search term criteria")
## Document 2 meets search term criteria


 

The Boolean retrieval model is a model for information retrieval in which we can pose any query which is in the form of a Boolean expression of terms, that is, in which terms are combined with the operators AND, OR, and NOT. The model views each document as just a set of words


The following can be said of the Boolean Retrieval Model:-
  • It can answer any query that is made up of boolean expressions
  • Boolean queries are queries that use and, or and not to join query terms
  • Views each document as a set of terms
  • It is precise - document matches conditions or not
  • Primary commercial retrieval tool for 3 decades
  • Many professional searchers (e.g., lawyers) still like boolean queries
  • You know exactly what you are getting
  • It does not have a built-in way of ranking matched documents by some notion of relevance
  • It is easy to understand. Clean formalism
  • It is too complex for web users
  • Incidence matrix is impractical for big collections
 

Exercise:


Consider these documents:
  Doc 1: breakthrough drug for schizophrenia
  Doc 2: new schizophrenia drug
  Doc 3: new approach for treatment of schizophrenia
  Doc 4: new hopes for schizophrenia patients

  1. Draw the term-document incidence matrix for this document collection
     
  2. What are the returned results for query
      schizophrenia AND drug


2.7 Classical IR Models: Vector Space Model

 

The representation of a set of documents as vectors in a common vector space is known as the vector space model. Every distinct word has one dimension.


Key idea: Documents and queries are vectors in a high-dimensional space.
 

Key issues:
 

• What to select as the dimensions of this space?
  • How to convert documents and queries into vectors?
  • How to compare queries with documents in this space?
 

The Vector Space Model assumes that
 

• the degree of matching can be used to rank-order documents;
  • this rank-ordering corresponds to how well a document satisfying a user’s information need
 

Steps in Vector Space Modelling

  • Convert the query to a vector of terms
  • Weight each component.
  • Consult the index to find all documents containing each term
  • Convert each document to a weighted vector
  • Query and documents mapped to vectors and their angles compared
  • Match the query vector against each document vector and sort the documents by their similarity
  • Similarity based on occurrence frequencies of keywords in query and document
  • Output documents are ranked according to similarity to query
 

Challenges
 

• Finding a good set of basis vectors.
  • Finding a good weighting scheme for terms, since model provides no guidance.
Usually variations on (length normalised) tf*idf   • Finding a comparison function, since again the model provides no guidance. Usually cosine comparison.
 

Comments on Vector Space Models
  • Simple, practical, and mathematically based approach
  • Lacks the control of a Boolean model (e.g., requiring a term to appear in a document)
 

Overall, Vector Space Models are hard to beat
 

Consider below documents and a query term
 

Document 1: Cat runs behind rat
  Document 2: Dog runs behind cat
  Query: rat
 

A term document matrix would be set up. This is a way is a way of representing documents vectors
  in a matrix format in which each row represents term vectors across all the documents and columns represent document vectors across all the terms.
 

Term weights are calculated for all the terms in the matrix across all the documents.
 

A word which occurs in most of the documents might not contribute to represent the document relevance whereas less frequently occurred terms might define document relevance. This can be achieved using a method known as term frequency - inverse document frequency (tf-idf) which gives higher weights to the terms which occurs more in a document but rarely occurs in all other documents, lower weights to the terms which commonly occurs within and across all the documents. Tf-idf = tf X idf
 

Similarity Measures: cosine similarity
 

Mathematically, closeness between two vectors is calculated by calculating the cosine angle between two vectors. The cosine angle between each document vector and the query vector is calculated to find its closeness. To find relevant document to the query term , the similarity score between each document vector and the query term vector by is calculated by applying cosine similarity . Whichever documents have a high similarity score will be considered as relevant documents to the query term.
 


 


 


 

Summary on VSM
     

 

Diving into Code

 

The IATI dataset will be used, further details on this dataset can be found here https://iatistandard.org/en/iati-standard/ The dataset used below is a subset which provides description on aid activity undertaken by various organisation in the aid sector around the world.



import operator
import pandas as pd
import re
import sklearn
from sklearn.decomposition import PCA
from sklearn import feature_extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk import ngrams
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import brown
from nltk.collocations import *
from nltk.corpus import webtext
import numpy as np
import random
import pickle
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity  


def set_tokens_to_lowercase(data):
    for index, entry in enumerate(data):
        data[index] = entry.lower()
    return data


def remove_punctuation(data):
    symbols = ",.!"
    for index, entry in enumerate(symbols):
        for index2, entry2 in enumerate (data):
            data[index2] = re.sub(r'[^\w]', ' ', entry2)
    return data

def remove_stopwords_from_tokens(data):
       stop_words = set(stopwords.words("english"))
       new_list = []
       for index, entry in enumerate(data):
           no_stopwords = ""
           entry = entry.split()
           for word in entry:
               if word not in stop_words:
                    no_stopwords = no_stopwords + " " + word 
           new_list.append(no_stopwords)
       return new_list
   
    
def stemming (data):
    st = PorterStemmer()
    for index, entry in enumerate(data):
        data[index] = st.stem(entry)
    return data
  
def read_data():
    raw_data_orig  = open("C:/IR Course/Adv -IR/IATI3.pkl","rb")
    raw_data_orig = pickle.load(raw_data_orig, encoding='iso-8859-1')
    raw_data_orig = raw_data_orig[raw_data_orig['description'].notnull()]
    return raw_data_orig


query ="climate change and environmental degradation"

def preprocess(pdf):
    for index, row in pdf.iterrows():
            row['description'] = " ".join(stemming(remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(row['description'].split(" "))))))
    return pdf

#preprocess documents
raw_data= preprocess(read_data())


#now preprocess query
query = " ".join(stemming(remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(query.split(" "))))))
rownames = raw_data["iati-identifier"]


#vectorise and get tfidf values
vectorizer = TfidfVectorizer()
vectorized_iati = vectorizer.fit_transform(raw_data["description"])
tdm = pd.DataFrame(vectorized_iati.toarray(), columns = vectorizer.get_feature_names())
tdm=tdm.set_index(rownames)

#now vectorise query
vectorized_query=vectorizer.transform(pd.Series(query))
query = pd.DataFrame(vectorized_query.toarray(), columns = vectorizer.get_feature_names())

# get cosine similarity

def cos_sim (pdf, qdf):
    f_similarity={}   
    for index, row in qdf.iterrows():
        for index2, row2 in pdf.iterrows():
             cos_result = cosine_similarity(np.array(row).reshape(1, row.shape[0]), np.array(row2).reshape(1, row2.shape[0]))
             f_similarity[index2] = round(float(cos_result),5)
    return f_similarity

cosine_scores=cos_sim (tdm, query)
#now rank
final_rank= sorted(cosine_scores.items(), key=operator.itemgetter(1), reverse=True)
final_rank = final_rank[0:5]
rownames = rownames.tolist()
unprocessed  = read_data()

for item in final_rank:
    if item[0] in rownames:
         
        print('IATI-IDENTIFIER {0} DESCRIPTION {1}'.format(item[0],unprocessed.iloc[rownames.index(item[0]),2])) 
        
## IATI-IDENTIFIER 41AAA-11960-005 DESCRIPTION UNOPS supports the Global Environment Facility (GEF) Small Grants Programme that helps protect poor, remote villages from the serious effects of climate change and environmental degradation. In an effort to support community-led initiatives, UNOPS efficiently channels direct grants to help communities cope with climate change, conserve biodiversity, protect international waters, reduce the impact of persistent organic pollutants, prevent land degradation, and adopt sustainable forest management practices.
## IATI-IDENTIFIER 41120-8599 DESCRIPTION vOverall Objective:Residents of cities in developing countries and their urban systems begin to become more resilient to the impacts and climate change, and reduce their carbon emissions.We intend to achieve this objectivethrough the followingExpected Accomplishments (EAs):EA1:The urban dimension is introduced into climate change agreements, strategies, policies, laws and regulations, and the climate change dimension is introduced into urban strategies, policies, laws and regulationsEA2:Urban decision-makers and stakeholders have increased capacity to reduce greenhouse gas emissions and adapt to climate change, and the institutions that build their capacities have adapted their teaching curricula and research accordinglyEA3:Cities participating in CCCI develop and begin to implement pro-poor strategies to adapt to climate change and embrace low carbon growth trajectoriesEA4:A global network of partners begins to advocate for policies to help cities better address climate change, access climate finance, and manage knowledge.The underlying development hypothesis here that relates the EAs to our Overall Objective is that together policy change and capacity building will lead to changes in behaviour. For example, policy reform at either the national or the local level (EAs Nos. 1 and 3, respectively), coupled with adequate institutional and human capacity (addressed in EA 2), will lead to actual reduced emissions from cities (part of Overall Objective). For further discussion of our strategy for implementing individual expected accomplishments, please see Section 2 of the CCCI Consolidated Strategy (May 2013), developed for discussion with the CCCI External Advisory Committee (available upon request).CCCI has followed roughly this logical framework for the past several years. However in recent years sharp cuts in annual replenishments, coupled with the increased visibility of cities in international processes with related demands on UN-Habitat, have tended to reduce resources available for city-level work (EA 3). 
## IATI-IDENTIFIER 41119-GW-REGULAR-S12-UNFPA DESCRIPTION UNFPA Guinea-Bissau regular-funded Activities to strengthen national capacity for production and dissemination of quality disaggregated data on population and development issues that allows for mapping of demographic disparities and socio-economic inequalities, and for programming in humanitarian settings activities implemented by UNFPA
## IATI-IDENTIFIER 41119-SS-OTHER-S10-NGO DESCRIPTION UNFPA South Sudan other-funded Activities to increase capacity to prevent gender-based violence and harmful practices and enable the delivery of multisectoral services, including in humanitarian settings activities implemented by NGO
## IATI-IDENTIFIER 41AAA-21461-001 DESCRIPTION Mejoramiento de la infraestructura de dos escuelas agrícolas


 

Exercise:

 

From the IATI10k.csv file, extract a sample of records (for example 100 rows) then do the following:
  1. Put the description column through pre-processing. Make a decision on what preprocessing routines would be suitable.
  2. Set up a suitable query to interrogate the document collection.
  3. Construct the inverted index with tf-idf scores. Ensure that the query is has also been converted to a vector with tf-idf scores.
  4.Compare the query vector with all other vectors in the document collection and calculate cosine similarity. Then store in a dictionary the iati-identifer field as a key with the cosine score as a value in a Python dictionary.
  5. Rank the dictionary by cosine scores (value field in the dictionary) and print the top 10 scores (sort in ascending value).
 

2.8 Classical IR Models: Probability based Information Retrieval

   


 

Use probability to determine relevance. How well does a document satisfy the query ?

An IR sytem has an uncertain understanding of the user query and makes an uncertain guess of whether a document satisifes the query.

Probability theory provides a principled foundation for such reasoning under uncertainty

The query and the documents are all observations from random variables. In the vector-based models, we assume they are vectors, but here we assume they are the data observed from random variables

And so, the problem of retrieval becomes to estimate the probability of relevance

In this category of models, there are different variants.
   

Probabilistic IR models at a glance

   

Classical probabilistic retrieval models


  Binary Independence Model
  Okapi BM25
  Bayesian networks for text retrieval
  Language model approach to IR
 

Probability Ranking Principle

 


 

Language Modelling Approach to Retrieval - Query Liklihood Retrieval Model

 

In query likelihood, our assumption is that this probability of relevance can be approximated by the probability of query given a document and relevance.
 

How do we compute this conditional probability?
 

This is where we build a Language Model.
 

What is a language model ?
 

** “The goal of a language model is to assign a probability to a sequence of words by means of a probability distribution” ** –Wikipedia

To understand what a language model, must know what is a:
 

• probability distribution
  • discrete random variable
 

In a unigram language model we estimate (and predict) the likelihood of each word independent of any other word
 

Defines a probability distribution over individual words


 

Sequences of words can be assigned a probability by multiplying their individual probabilities:

** P(university of north carolina) = P(university) x P(of) x P(north) x P(carolina) = (2/20) x (4/20) x (2/20) x (1/20) = 0.0001 **

There are two important steps in language modeling
  ‣ estimation: observing text and estimating the probability of each word
  ‣ prediction: using the language model to assign a probability to a span of text.
 

Unigram Language Model Estimation

 

General estimation approach:
 

‣ tokenize/split the text into terms
  ‣ count the total number of term occurrences (N)
  ‣ count the number of occurrences of each term (tft)
  ‣ assign term t a probability equal to
 

Document Language Models

  • Suppose we have a document D, with language model
  • We can use this language model to determine the probability of a particular sequence of text
  • How? We multiple the probability associated with each term in the sequence!
 

Example:-
 


 

Question:
 

What is the probability given by this language model to the sequence of text “rocky is a boxer” or “a boxer is a pet”?
 

To summarise how is the document model estimated for each document?
 


 

Query-Likelihood Retrieval Model: Some Examples

 

• Objective: rank documents based on the probability that they are on the same topic as the query
  • Solution:
  ‣ Score each document (denoted by D) according to the probability given by its language model to the query (denoted by Q)
  ‣ Rank documents in descending order of score
 


 

Every document in the collection is associated with a language model
  • Let denote the language model associated with document D
  • Think of a “black-box”: given a word, it outputs a probability
 

Let P(t|θD) denote the probability given by to term t
 


 

Question:
 

Which would be the top-ranked document and what would be its score?
 


 

P(q|M1) > P(q|M2)


 

Query-Likelihood Retrieval Model: Some Issues

 

There are (at least) two issues with scoring documents based on query terms
 

A document with a single missing query-term will receive a score of zero (similar to boolean AND)
  • Where is IDF?
  • Na attempt should be made to suppress the contribution of terms that are frequent in the document and also frequent in general (appear in many documents)?
   

Query-Likelihood Retrieval Model: Add One Smoothing and Linear Interpolation

   

• The goal of smoothing is to …
 

‣ Decrease the probability of observed outcomes
  ‣ Increase the probability of unobserved outcomes
 

Add One Smoothing
 


 

A more effective approach to smoothing for information retrieval is called linear interpolation
 

Let denote the language model associated with document D
  • Let denote the language model associated with the entire collection
  • Using linear interpolation, the probability given by the document language model to term *t is:
 


 

As before, a document’s score is given by the probability that it “generated” the query
 

• As before, this is given by multiplying the individual query-term probabilities
 

• However, the probabilities are obtained using the linearly interpolated language model
 

Without smoothing, the query-likelihood model ignores how frequently the term occurs in general!
 


 


 


 

Diving into Code


import nltk
import sys
import codecs
import nltk
from nltk.corpus import stopwords
import csv
import pandas
import re
import numpy as np


   
df = pandas.read_csv('C:/IR Course/Adv -IR/IATI10k.csv', header = 0, encoding="iso-8859-1")
df = df[df.description.notnull()]

def set_tokens_to_lowercase(data):
    for index, entry in enumerate(data):
        data[index] = entry.lower()
    
    return data

def remove_punctuation(data):
    symbols = ",.!"
    for index, entry in enumerate(symbols):
        for index2, entry2 in enumerate (data):
            data[index2] = re.sub(r'[^\w]', ' ', entry2)
            data[index2] = entry2.strip()
            
    return data

def remove_stopwords_from_tokens(data):
       stop_words = set(stopwords.words("english"))
       stop_words.add(" ")
       new_list = []
       for index, entry in enumerate(data):
               if entry not in stop_words:
                    new_list.append(entry)
       return new_list

def clean_df(pdf):
  
    for index, row in pdf.iterrows():
         row['description'] =  remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(row['description'].split())))  
         row['description'] = " ".join(x for x in row['description'])
    return pdf


def calc_docscore(pdf, pqry):
    col_names =  ['Description', 'score']
    f_df2  = pandas.DataFrame(columns = col_names)
    for index, row in pdf.iterrows(): 
        rank = []
        docscore = 0
        scored = score(row['description'])
        for word in pqry.split(" "):
            if word in scored.keys():
                rank.append(float(scored[word] )+ float(allcounts[word]/total)/2)
        
        if rank != []:
            docscore = np.prod(np.array(rank)) 
           
        f_df2.loc[index] = pandas.Series({'Description':row['description'], 'score':docscore})
    return f_df2
               
def score (pstr):
    fdict = {}
    flist = pstr.split()
    fdict = dict(nltk.FreqDist(flist))
    for  key, value in fdict.items():
        fdict[key] = round(fdict[key]/len(flist),2)
    return fdict

df = clean_df(df)
qry = "reduce transmission of HIV"
qry=  remove_stopwords_from_tokens(remove_punctuation(set_tokens_to_lowercase(qry.split())))
qry = " ".join(x for x in qry)      

allcounts = {} 
for descript in df['description']:
      tmp = dict(nltk.FreqDist(descript.split()))
      for key, value in tmp.items():
        if key not in allcounts:
            allcounts[key] = value
        else: 
            allcounts[key] = allcounts[key] + value
total = sum(allcounts.values())
df2=calc_docscore(df, qry) 
   
df2sort_by_score = df2.sort_values('score', ascending=False)
print (df2sort_by_score[1:20])
    
  
##                                             Description      score
## 136   goal project contribute reduction hiv incidenc...   0.121396
## 180   sa school-based sexuality hiv prevention educa...   0.121396
## 142   hiv prevention treatment professional sex work...   0.121396
## 143   hiv prevention treatment professional sex work...   0.121396
## 360   ?improving diabetes care prevention piloting h...   0.121396
## 167   reduce hiv/aids prevalence prevention identify...   0.120252
## 9537  unops helps procure retroviral drugs hiv progr...   0.101396
## 311   pace uganda sub mildmay hiv related project fu...   0.101396
## 211   objective program increase use evidence-based ...   0.101396
## 319   ?psi subreceipent project hope usaid award eth...   0.101396
## 281   sfh south africa/psi sub tb hiv care award hts...  0.0913958
## 65    psi kenya helping 3ie build evidence towards u...  0.0913958
## 38    overall goal reduce mortality morbidity due ma...  0.0902525
## 114   expand strengthen high quality community-based...  0.0813958
## 266   goal program scale-up strengthen delivery qual...  0.0813958
## 242   increased access universal hiv prevention serv...  0.0813958
## 316   using user-centered approaches increase adopti...  0.0813958
## 149   global fund project funded zimbabwe ministry h...  0.0802525
## 249   hiv identification treatment program lesotho -...  0.0713958


 

Exercises:
 

Suppose the document collection contains two documents:

\(d_1\): Xyzzy reports a profit but revenue is down
  \(d_2\): Quorus narrows quarter loss but revenue decreases further
 

The query is: “revenue down”
 

Calculate maximum liklihood estimates for terms in document 1 and document 2.
  Apply the linear interpolation and calulate the score for query/document 1 and query/document 2.
 

  1. Below are 4 mini documents, used previously.
     

D1: He likes to wink, he likes to drink
  D2: He likes to drink, and drink, and drink
  D3: The thing he likes to drink is ink
  D4: The ink he likes to drink is pink
  D5: He likes to wink, and drink pink ink
 

Query: “drink pink ink”
 

Write Python code to do the following:
 

  • lower case text, remove punctuation
     
  • apply linear interpolation to calculate score for document and query
     

3.0 Future of IR - Challenges Ahead

 

Data Volume
 

Amount of digital data in 2007: 281 exabytes = 281 trillion digitized novels
 

“Every 2 days now, we create as much information as we did from the dawn of civilisation up until 2003”
 

Eric Schmidt
 

Vocabulary mismatch problem due to synonymy and polysemy.
    The same word has different meanings.
  A search engine might not be able to guess the right meaning if appropriate contexts are not provided.
 

IR-systems are as good as the query provided to them.
    Queries are provided by the human, and human is the weak link in this chain.
  So, high quality query is a must. With a very bad query, you can defeat any search engine.
 

A search query is : Windows


For the query, a search engine (like- Google) can show results of three types as following:

Computer OS: Windows
  Windows of buildings
  Combination of both (1) and (2)
   

It is not the intention of a search engine to provide results of type 3, i.e., combined results of Windows of OS and buildings.
  Because, a user, who works in a building company, might not want Computer OS Windows as the output of the query.
  The output should be building windows for this type of users.
 

On the other-hand, similarly. another user working as a software engineer, should get the output of Windows OS for the query.
 

This type of query results based on person’s interests is called personalized search engines.
  It is one of the most challenging sides of Information Retrieval (IR) to provide results based on person’s interests and ranked the results accordingly.